An Introduction to HyperFlex

An Introduction to HyperFlexCisco_logo.png

Cisco was founded in 1984 and has since grown to become a leader in networking, data center, and collaboration. They made an attempt to enter into the storage space with the purchase of Whiptail, a solid state memory storage solution in 2013. Rebranded as Cisco UCS Invicta, this solution was unsuccessful and was discontinued in 2015.

Later in 2015, Cisco established OEM agreements with Springpath, a hyperconvergence software company. In rather effective secrecy, Cisco was able to work closely with Springpath to develop, test, and release a full-fledged hyperconverged solution powered by Cisco UCS servers and Fabric Interconnect along with Springpath software (rebranded as the HyperFlex Data Platform).

From what we can tell so far, it seems as though HyperFlex falls neatly into the category of HCI, defined by AHEAD as “A combination of server, storage, and hypervisor into a node-based architecture with node-based scalability leveraging software-defined storage and providing simplified management with advanced infrastructure automation”.  

In order to provide customers with a solution that offers all the benefits available from a traditional storage array, but without the common complexity and scalability issues, Cisco has combined their established UCS blade and rack mount offering with the HyperFlex data platform. Although this is what its competitors have been already doing, Cisco is likely to focus on one unique element that no other competitor in this space is: including fabric interconnects to aggregate network connections.

The result is that a HyperFlex solution only really needs two 1Gb links to northbound switches to get started. More importantly, it removes the question that most HCI customers ask: “How should I aggregate all these 10GB connections?”

HyperFlex Details

Architecture and Features

  • Product Range
    • In the first release, the solution is expected to only support virtual workloads, specifically running VMware vSphere.
    • The product at launch can scale pretty well, up to several thousands of VMs on a fully loaded 4-cluster (32 storage nodes, 16 compute node system, all managed with a single vCenter plugin).
    • Incompatible workloads include those that require less than 5 ms latency, those that require more SSD capacity than the available cache in the system, data sets that are not deduplication or compression friendly, or mission critical apps.
  • Scalability Interdependence 
    • HyperFlex allows customers to add B200 blades as compute-only nodes inside of a HyperFlex cluster. There is a limitation, though; currently you can’t have more than 4 compute nodes in a cluster, but it’s a great start.
    • No information yet on whether or not there will be storage-only nodes but it should be technically possible, and probable if customers require it.
  • Flexibility
    • HyperFlex currently only supports VMware, but I expect to see multi-hypervisor support as early as this year.
    • You can change the combination of CPU and RAM on the nodes and they don’t have to match each other within a cluster, but the storage configuration has to be exact within a node. In other words, today, you can’t have a cluster composed of nodes that have different hard drive sizes.
    • Storage in the HyperFlex cluster can only currently be consumed by HyperFlex nodes.
      • Due to the underlying technology, it should be possible to present storage from the cluster to bare-metal servers other than B200 blades, so if this is a strong requirement, I assume that Cisco will try to make it happen.
    • HyperFlex does not support encryption today, but it is expected to come very soon.

Data Protection, Replication, High Availability

  • Data protection: HyperFlex supports snapshots both through the HyperFlex Data Platform as well as through VAAI integration. HyperFlex can automate large clone operations that can facilitate renaming, IP address changes, etc.).
  • Replication Solutions: HyperFlex does not include site to site (active-passive) replication today but it is in the roadmap. For now, Cisco has partnered with Zerto to fill this gap and has established them as a “certified” replication solution for HyperFlex.
  • Highly Available Solutions: HyperFlex does not currently include active-active across-data center functionality out of the box. It’s not a critically important feature, and we aren’t sure when/whether this functionality will be released.

Primary Components

  • Compute
    • All HyperFlex solutions have a pair of Fabric Interconnect 6248s to connect everything together. Cisco decided to not include the newer 6300 series because after extensive testing, they decided that 40GB ports were not yet required. These FIs are unlocked and available for customers to use for other needs beyond HyperFlex.
    • Cisco has modified UCS C220 (1RU) and UCS C240 (2RU) servers and rebranded them as HX220c and HX240c. The HX220c is configured with (6) 1.2TB SAS drives, (1) 120GB SSD for the controller VM, and (1) 480 GB SSD for read/write cache. The HX240c is similar, except that it has (15) 1.2TB Drives (1) 1.6TB SSD for read/write cache. Both nodes include SD cards for the hypervisor installation.
    • The HX220c/HX240c nodes actually serve as both storage and compute nodes, supporting up to 768GB of RAM and up to a pair of E5-2699 V3 Intel processors.
    • B200 servers can also participate in a cluster which would be compute-only nodes. Today you can only have as many B200s in a cluster as you have HX nodes.
    • Graphics acceleration cards are not currently supported, but should be in the future.
  • Storage Array
    • This is the section where HCI products shine. There is no dedicated storage array. Instead storage is clustered across nodes in a redundant manner and presented back to each node, in this case via NFS. The HyperFlex Data Platform (powered by Springpath) is what provides this.
    • This “storage magic” is made possible by a virtual machine running on each HX220c/HX240c node. Local disks on each node are passed directly to the VM which creates a cluster with other VMs in the cluster. Note that RAID is not used, so there are no write penalties. Instead, HyperFlex makes multiple copies of data across nodes to provide data redundancy. Unlike Nutanix, Cisco is expected to more strongly recommend three copies of data, instead of two, and also has some interesting settings to give customers more granular control over failure scenarios.
    • Deduplication and compression are both in-line, deduplication is not global (per node basis) and currently deduplication cannot be turned off.
    • Cisco was quick to point out that they do not believe in “data locality”, a feature touted by Nutanix. Instead, HyperFlex will spread all data across all nodes equally, which they believe provides higher performance by allowing all drives to be engaged during reads, as well as higher storage efficiency by balancing capacity across all nodes more equally.
    • HyperFlex includes a special VIB named Iovizer which directs storage IO from the VMs to the clustered HyperFlex system. The benefit of this is that in the event of a VM controller failure, storage IO is not interrupted and the VMs on a storage node do not fail. This is unique in this space, and it is also how HyperFlex was able to support compute-Fonly nodes.
  • Storage Fabric
    • HyperFlex leverages a pair of 10GB ports provided by VIC1227 on the HX220c and HX240c nodes which are connected into the fabric interconnects.
      • This is a big difference when compared to other HCI providers because Cisco effectively takes care of aggregation and takes it a step further by building QoS and vSwitch settings out of the factory all the way through the FIs.
      • We expect a future release to include the 6300 FIs that will provide 40GB support, although Cisco (and Nutanix actually) don’t see a significant need for 40GB connectivity.
  • Network
    • Any major 10G capable switches can be used to connect to the FIs and technically 1G could be enough depending on the workload.

Hypervisor Compatibility

  • Cisco HyperFlex currently supports vSphere. Hyper-V and KVM are on the roadmap

Management Features

  • Centralized Visibility
    • Cisco is focusing on bringing cluster management into the hypervisor management tool (vCenter) rather than dedicating a standalone management UI. Although there is a cluster IP address from which one can review data, a vCenter plugin allows you to view all major cluster data.
  • Centralized Management
    • The vCenter plugin gives you the ability to take many actions against the cluster, which includes creating/deleting modifying datastores, creating and managing storage snapshots, and whitelist/blacklist/identify drives.
      • In the future, I hope to see more tasks, such as cluster expansions through the vCenter plugin.
  • Automation
    • The initial product is not expected to have a significant amount of automation.
    • There is a tool that facilitates cluster creation and expansion and the Springpath software upgrade is a simple and automated process. However, the solution does not upgrade vSphere for you today.
    • Unlike other hypervisors, HyperFlex does benefit from the fabric interconnects that enable customers to easily manage firmware updates and server profiles.
      • I’ll be interested to see if those functions, currently managed through UCS Manager, will be brought into the HyperFlex plugin.
  • Monitoring and Alerting
    • Monitoring and alerting is all aggregated into vCenter, which is great for most customers who rely on vCenter directly or who use other tools that pull data from vCenter.
    • Hardware logging details, such as drive failures will be available and can be alerted against, through vCenter.
    • Logical alerts, such as cluster low disk space will also be available in vCenter.
  • Integration
    • Integration into existing management tools is expected to continue into other hypervisors as they are released.

Final Thoughts 

Cisco has decided that they want to be a direct player in the $1.8B HCI market by combining a proven and highly popular server platform with new OEM software defined storage. Cisco’s existing fan base of engineers that enjoy working with UCS and are trained to do so are expected to strongly favor this solution over that of competitors. The message of Cisco—that they can provide all network, collaboration, data center, and now HCI solutions—is a powerful one, as most customers want less, opposed to more, manufacturers in their data centers. The inclusion of fabric interconnects further strengthens the solution by addressing one of the challenges of traditional HCI. All these benefits, combined with aggressive pricing, could lead to a very successful product.

That being said, this is still a 1.0 product that has not yet been tested in the field. One of the key tenants of HCI is that it is simple and automated. This is important because HCI is fundamentally confusing for many; you have storage inside servers that you pass to controller VMs, who then create clustered storage and present it back to the same hosts that they are running on. Nutanix has been successful by hiding and automating away all that complexity and we will soon see whether HyperFlex is simple and automated enough to compete in this space.

Here are some follow-up questions (and some last-minute thoughts) that I hope to address over the next month:

  • Overall manageability – How easy or difficult is it to manage the solution in the field? Are the tools rock solid or buggy? In the Cisco lab, I successfully broke the ova tool used to deploy a cluster a few times, but that was due to me typing invalid IP address information (conflicting IP addresses between the ESX hosts and the virtual controllers). It’s little things like this that aren’t too difficult to fix in the wizard, but that can quickly, negatively affect customer experience. With that said, I was testing against older code, so chances are that customers won’t have the opportunity to duplicate my mistakes.
  • Upgrades – How simple is it to upgrade the system, and is it actually upgraded as a system, or do end users have to manually update individual components? Cisco does tout some “single click” operations but only for part of the system. Springpath software can be updated through a single click, but UCS manager has to be used for the hardware elements and vSphere is upgraded on its own.
  • Overall system performance – Cisco claims that their solution matched or beat Nutanix on almost all performance categories; it will be interesting to run the solution through a series of internal tests.
  • Deduplication ratios – This is a very difficult number to track because everyone’s data is different. With that said, I am curious as to how it compares to other hyperconverged solutions, as well as to dedicated arrays. From what I’ve gathered so far, Cisco doesn’t have any strong recommendations on this subject yet.
  • Storage controller overhead – the virtual controller is documented at requiring 48 GB to 72 GB of RAM and 8 vCPUs (10,800 mhz). Other manufacturers have grown their requirements over time as they add more features. I am curious to see how real these numbers are and whether or not they grow as they unlock more features.
  • Flexibility – It was unclear when I first gained access to the product data whether you can select from a multitude of drive sizes when configuring these nodes or if you are limited to the 1.2TB SAS drives for the capacity tier.
  • Reliability – Cisco hasn’t had success as a storage company and doesn’t have significant experience with software defined storage. It will be interesting to see whether Cisco can quickly release new features to gain parity with competition without affecting stability and reliability
  • Acquisition – Will Cisco acquire Springpath? If Cisco doesn’t acquire Springpath and instead, Springpath continues to sell and develop its own non-Cisco version of the software, will it lag behind competitors? And if Cisco do purchase Springpath, will Cisco be able to manage them without disrupting their success?